Many X Variables

Module 4.2: Binary Variables and Interactions

Alex Cardazzi

Old Dominion University

Guidance Counselors

To start this portion of the module, we are going to revisit our beloved SAT - GPA data (data, documentation).

Last module, we began by estimating the relationship between HS GPA and the sum of SAT Percentiles. We then split the sample into males and females, and estimated two models separately for each group. Let’s re-do these three models so we can have them as baselines.

Code
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/satgpa.csv")
df <- df[df$hs_gpa <= 4,]
r1 <- lm(sat_sum ~ hs_gpa, df)
r2 <- lm(sat_sum ~ hs_gpa, df, subset = df$sex == 1)
r3 <- lm(sat_sum ~ hs_gpa, df, subset = df$sex == 2)

library("modelsummary")
regz <- list(`All Students` = r1,
             `Female` = r2,
             `Male` = r3)
coefz <- c("hs_gpa" = "High School GPA",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of GPA on SAT Scores",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of GPA on SAT Scores
All Students Female Male
High School GPA 11.538*** 11.392*** 13.715***
(0.753) (0.999) (1.079)
Constant 66.468*** 70.196*** 55.851***
(2.440) (3.167) (3.579)
Num.Obs. 999 515 484
R2 0.191 0.202 0.251

Guidance Counselors

Next, let’s recreate some visualizations of these regression lines.

Code
plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = scales::alpha(ifelse(df$sex == 1, "tomato", "dodgerblue"), 0.4),
     ylab = "SAT Percentile Sum",
     xlab = "High School GPA")
abline(r2, col = "tomato", lwd = 3)
abline(r3, col = "dodgerblue", lwd = 3)
abline(r1, col = "black", lwd = 2, lty = 2)
legend("bottomright", bty = "n", lty = c(2, 1, 1),
       col = c("black", "tomato", "dodgerblue"),
       legend = c("All", "Female", "Male"), lwd = 2)
Plot

Splitting our sample like this is convenient because it allows for a lot of flexibility – each group gets their own parameters. However, remember: the more parameters you need to estimate, the more difficult inference becomes.1

If we wanted to reduce the number of parameters in our model (which is currently four: two intercepts and two slopes), we could simply not differentiate by gender. By doing this, we are sacrificing a bit of bias for reductions in variance. Then, we’d only have the two parameters (intercept and slope) to estimate. Is there a compromise we can make, and maybe estimate just three parameters?

Guidance Counselors

The short answer is: yes! Let’s explore this a bit.

In the two-model situation, we have two intercepts and two slopes. In order to reduce the number of parameters, we would need to make an assumption about the setting we’re working with. We could either have male and female students share a slope or an intercept. If they share a slope, we would only have to estimate the single slope parameter and two intercept parameters. After looking at the two regression lines, they seem to be pretty parallel, which means that the same-slope assumption seems reasonable.

We are going to estimate the following regression equation:

\[\text{SAT}_i = \beta_0 + \beta_1 \times \text{Female}_i + \beta_2 \times \text{GPA}_i + \epsilon_i\]

where \(\text{Female}_i\) is a binary variable equal to 1 if the observation if for a female student and 0 otherwise, and \(\text{GPA}_i\) is the student’s high school GPA.

Guidance Counselors

Before we estimate the model, let’s run through some of the algebra of this model. Suppose we are working with a male student, and they have a 3.0 GPA. What do we expect their SAT to be?

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 0) + (\widehat{\beta}_2 \times 3) = \widehat{\beta}_0 + 3\widehat{\beta}_2\]

What about a female student with the same GPA?

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 1) + (\widehat{\beta}_2 \times 3) = \widehat{\beta}_0 + \widehat{\beta}_1 + 3\widehat{\beta}_2\]

The only difference in expected SAT scores is \(\widehat{\beta}_1\), and this is independent of \(\text{GPA}\). In other words, we would interpret \(\beta_1\) as an intercept shifter/modifier, or the average difference between female and male scores.

Let’s estimate the model and plot the resulting lines.

Guidance Counselors

Code
# create binary indicator
df$female <- ifelse(df$sex == 1, 1, 0)
r4 <- lm(sat_sum ~ female + hs_gpa, df)
summary(r4)
Output

Call:
lm(formula = sat_sum ~ female + hs_gpa, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.642  -8.401  -0.083   8.540  34.198 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  59.9776     2.4687  24.295   <2e-16 ***
female        6.8978     0.7927   8.702   <2e-16 ***
hs_gpa       12.4561     0.7335  16.982   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.39 on 996 degrees of freedom
Multiple R-squared:  0.248, Adjusted R-squared:  0.2465 
F-statistic: 164.2 on 2 and 996 DF,  p-value: < 2.2e-16
  1. The intercept is the expected sum of SAT percentiles when hs_gpa and female are equal to zero. Therefore, the interpretation for this coefficient is now the expected sum of SAT percentiles for a male student with a GPA of zero.
  2. Keeping hs_gpa at zero, the expected sum of SAT percentiles for female students is 59.98 + 6.9 = 66.88.
  3. Increasing hs_gpa has the same effect for both groups given our initial assumption. A one unit increase in hs_gpa leads to an increase in SAT of 12.46

Guidance Counselors

Visualizing these results on a set of axes:

Code
plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = scales::alpha(ifelse(df$sex == 1, "tomato", "dodgerblue"), 0.4),
     ylab = "SAT Percentile Sum",
     xlab = "High School GPA")
abline(a = coef(r4)[1], b = coef(r4)[3], col = "dodgerblue", lwd = 3)
abline(a = sum(coef(r4)[1:2]), b = coef(r4)[3], col = "tomato", lwd = 3)
legend("bottomright", bty = "n", lty = c(1, 1),
       col = c("tomato", "dodgerblue"),
       legend = c("Female", "Male"), lwd = 2)
Plot

High School GPA vs SAT Percentile Sum, colored by reported sex.

Note how these lines are completely parallel. This is a result of our assuming that the two groups share the same slope. In this case, it does not seem to be a bad assumption. When we include a binary variable into a regression, we call these dummy variables.

Now, what if we wanted to allow the slopes to differ but not the intercept. Well, we can do this too!

Guidance Counselors

To allow for differing slopes by group, we can specify the following functional form:

\[\text{SAT}_i = \beta_0 + (\beta_1 \times \text{GPA}_i) + (\beta_2 \times \text{GPA}_i \times \text{Female}_i) + \epsilon_i\]

If a male student had a GPA of 0, we would expect their SAT to be:

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 0) + (\widehat{\beta}_2 \times 0 \times 0) = \widehat{\beta}_0\]

Similarly, if a female student had a GPA of zero, their expected SAT would be:

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 0) + (\widehat{\beta}_2 \times 0 \times 1) = \widehat{\beta}_0\]

So, in other words, the intercepts would be exactly the same, which matches our assumption.

Guidance Counselors

What if these two students had 3.0 GPAs? How would their expected SAT scores change?

For the male student:

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 3) + (\widehat{\beta}_2 \times 3 \times 0) = \widehat{\beta}_0 + 3\widehat{\beta}_1\]

For the female student:

\[\begin{aligned}\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 3) + (\widehat{\beta}_2 \times 3 \times 1) &= \widehat{\beta}_0 + 3\widehat{\beta}_1 + 3\widehat{\beta}_2 \\&= \widehat{\beta}_0 + 3(\widehat{\beta}_1 + \widehat{\beta}_2)\end{aligned}\]

This makes the difference between the two groups equal to \(3\widehat{\beta}_2\), or \(\text{GPA} \times \widehat{\beta}_2\). If \(\widehat{\beta}_2\) is positive (negative), the difference between the groups would widen (shrink) as \(\text{GPA}\) increased.

Guidance Counselors

To estimate this regression in R, we will write the equation as follows: sat_sum ~ hs_gpa + hs_gpa:female. The colon (:) tells R to multiply the two variables in the model like we have written above.

Code
r5 <- lm(sat_sum ~ hs_gpa + hs_gpa:female, df)
summary(r5)

plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = scales::alpha(ifelse(df$sex == 1, "tomato", "dodgerblue"), 0.4),
     ylab = "SAT Percentile Sum",
     xlab = "High School GPA")
abline(a = coef(r5)[1], b = coef(r5)[2], col = "dodgerblue", lwd = 3)
abline(a = coef(r5)[1], b = sum(coef(r5)[2:3]), col = "tomato", lwd = 3)
legend("bottomright", bty = "n", lty = c(1, 1),
       col = c("tomato", "dodgerblue"),
       legend = c("Female", "Male"), lwd = 2)
Output

Call:
lm(formula = sat_sum ~ hs_gpa + hs_gpa:female, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.884  -8.284  -0.255   8.600  34.832 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    63.9518     2.3804  26.866  < 2e-16 ***
hs_gpa         11.3040     0.7288  15.511  < 2e-16 ***
hs_gpa:female   2.0289     0.2446   8.294 3.54e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.43 on 996 degrees of freedom
Multiple R-squared:  0.2431,    Adjusted R-squared:  0.2415 
F-statistic: 159.9 on 2 and 996 DF,  p-value: < 2.2e-16
Plot

GPA and SAT Percentile Sum with two parallel regression lines going through the data.

Here, the two lines certainly diverge as hs_gpa increases, which matches up with the regression coefficient (2.03). To interpret this coefficient: female student SAT scores grow two points more than male student SAT scores when increasing GPA by one.

The coefficient for the product (multiplication) of two variables in a regression is called an interaction term.

Guidance Counselors

Econometricians like to use dummy variables and interaction terms because they allow us to consider the setting’s inherent heterogeneity in our models. However, another convenient thing about them is that they allow us to estimate differences.

For example, what if we wanted to know: do female students preform better on SAT exams than male students? Before this module, we might collect SAT scores for female and male students then do a difference-in-means hypothesis test (from Module 2). If we found a statistically significant difference, however, we could not be sure that the difference was in fact due to gender but rather due to the female students in the sample just being smarter overall. By using regression, we can partial out the effect of intelligence (via GPA1) and then re-calculate the difference in means. In fact, our female dummy variable from r4 is exactly that: a difference in means test. The difference we found was 6.9, and it was statistically significant even after controlling for hs_gpa.

Guidance Counselors

Now that we’ve estimated these two models with three parameters each, we can combine them into one giant model. This model will look like the following:

\[\begin{aligned}\text{SAT}_i = \ &\beta_0 + (\beta_1 \times \text{Female}_i) + (\beta_2 \times \text{GPA}_i) \\ &+ (\beta_3 \times \text{GPA}_i \times \text{Female}_i) + \epsilon_i\end{aligned}\]

Coefficient Interpretation:

  1. \(\beta_0\) is the expected SAT for male students with a GPA of 0.
  2. \(\beta_1\) is the difference in the expected SAT for female students relative to male students with a GPA of 0.
  3. \(\beta_2\) is the change in expected SAT for male students if they increase their GPA by 1.0
  4. \(\beta_3\) is the difference in the change in expected SAT for female students relative to male students when they increase their GPA by 1.0.

Guidance Counselors

Let’s take a look at the parameter estimates and compare them to the previous gender-specific models:

Code
r6 <- lm(sat_sum ~ female + hs_gpa + hs_gpa:female, df)
# As a note, we could have used "female*hs_gpa" in the model.
#   "*" is a short-hand way to include both variables
#   by themselves in addition to their interaction.
#   However, when learning, I think it's best to write out
#   the full euqation.
regz <- list(`Female` = r2,
             `Male` = r3,
             `All Students` = r6)
coefz <- c("hs_gpa" = "HS GPA",
           "female" = "Female",
           "female:hs_gpa" = "HS GPA x Female",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of GPA on SAT Scores",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of GPA on SAT Scores
Female Male All Students
HS GPA 11.392*** 13.715*** 13.715***
(0.999) (1.079) (1.083)
Female 14.345**
(4.782)
HS GPA x Female −2.323
(1.471)
Constant 70.196*** 55.851*** 55.851***
(3.167) (3.579) (3.593)
Num.Obs. 515 484 999
R2 0.202 0.251 0.250

Guidance Counselors

Let’s break down the model’s output:

  • First, notice how the coefficients for “HS GPA” and “Constant” in the third column match the coefficients in the second column. This is because those two coefficients in the third column are, in essence, male-specific. In effect, we have matched the estimates from the male-only model!
  • You should also note that even though the coefficients match, the standard errors are slightly larger in the third column. This is due to having to estimate additional parameters.
  • If we replicated the male-specific model, what about the female specific model? Well, if we add the constant and the female dummy, we get 70.196, which is the constant in the female-specific regression. This is similar for HS GPA and HS GPA x Female – the sum of these two equals the HS GPA coefficient in the first column.
  • The benefit of estimating all of these parameters in a single model is that now we can look to see if the differences between coefficients in the models are statistically different. For example, the difference in constants across the models is indeed statistically different according to the stars next to the Female parameter in column 3. We also find that the two slopes across the models are not statistically different.
  • The results of this model suggest that female students do perform better on the SAT than their male counterparts. However, the difference between male and female students is relatively constant at all levels of intelligence (i.e. GPA) given the insignificant interaction term.

Guidance Counselors

Initially, we made an arbitrary choice by selecting males to be the “reference” group in our setting. We could have made the exact opposite decision and created a dummy variable for males. We can re-run these models by using a male dummy:

Code
df$male <- ifelse(df$sex == 2, 1, 0)
r7 <- lm(sat_sum ~ male*hs_gpa, df)
regz <- list(`Female` = r2,
             `Male` = r3,
             `All Students (M ref)` = r6,
             `All Students (F ref)` = r7)
coefz <- c("hs_gpa" = "HS GPA",
           "female" = "Sex",
           "female:hs_gpa" = "HS GPA x Sex",
           "male" = "Sex",
           "male:hs_gpa" = "HS GPA x Sex",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of GPA on SAT Scores",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of GPA on SAT Scores
Female Male All Students (M ref) All Students (F ref)
HS GPA 11.392*** 13.715*** 13.715*** 11.392***
(0.999) (1.079) (1.083) (0.996)
Sex 14.345** −14.345**
(4.782) (4.782)
HS GPA x Sex −2.323 2.323
(1.471) (1.471)
Constant 70.196*** 55.851*** 55.851*** 70.196***
(3.167) (3.579) (3.593) (3.155)
Num.Obs. 515 484 999 999
R2 0.202 0.251 0.250 0.250

Guidance Counselors

Numerically, it doesn’t matter which group is the reference group. However, it certainly changes the interpretation of your coefficients which should be carefully considered.

As a final note, it is important to mention that you cannot include both the Female and Male dummies in the same regression. These two variables are perfectly collinear with the constant, which makes for a regression that cannot be estimated.

Housing Price Models

Of course we love pretending to be guidance counselors, but we really love being real estate agents. For (maybe?) the last time, let’s read in our ames housing data (data; documentation). We are going to keep all of the variables we did last time plus one more: BsmtFin.Type.1.

Definition/Details of BsmtFin.Type.1
BsmtFin Type 1  (Ordinal): Rating of basement finished area

       GLQ  Good Living Quarters
       ALQ  Average Living Quarters
       BLQ  Below Average Living Quarters   
       Rec  Average Rec Room
       LwQ  Low Quality
       Unf  Unfinshed
       NA   No Basement
Source: https://jse.amstat.org/v19n3/decock/DataDocumentation.txt

Let’s load in our data and re-name the columns.

Code
ames <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/ames.csv")
keep_columns <- c("price", "area", "Bedroom.AbvGr", "Year.Built", "Overall.Cond", "BsmtFin.Type.1")
ames <- ames[,keep_columns]
colnames(ames) <- c("price", "sqft", "bedrooms", "yr_built", "condition", "bsmnt")
ames$age <- 2011 - ames$yr_built

Housing Price Models

Let’s take a peak at the bsmnt variable to see what all it contains. To do this, we are going to use table(), but add in useNA = "always" just in case there are any missing data elements in the column.

Code
table(ames$bsmnt, useNA = "always")
Output

      ALQ  BLQ  GLQ  LwQ  Rec  Unf <NA> 
   1  429  269  859  154  288  851   79 

According to this tabulation, there appears to be 79 observations with no input, and 1 observation with a blank input. According to the definition above, NA values mean “no basement”.1 Let’s change the data from NA to "No Basement".

Code
ames$bsmnt <- ifelse(is.na(ames$bsmnt) | ames$bsmnt == "", "No Basement", ames$bsmnt)

Housing Price Models

If we were real estate agents, it would be really nice to have some way to quantify this basement variable so we could use it in our model. Otherwise, we might end up like Zillow with a potentially huge omitted variable issue. To start, let’s just create a dummy variable equal to 1 if the property has a basement and a zero otherwise.

Code
ames$has_bsmnt <- ifelse(ames$bsmnt == "No Basement", 0, 1)

Similar to how we controlled for gender differences, which is a categorical variable, we can include this variable into our housing price model. For simplicity, we are going to just use square footage as our other explanatory variable. Students are encouraged to explore the inclusion of other variables by themselves.

Housing Price Models

Estimating our model:

Code
reg1 <- lm(log(price) ~ log(sqft) + has_bsmnt, ames)
reg1
Output

Call:
lm(formula = log(price) ~ log(sqft) + has_bsmnt, data = ames)

Coefficients:
(Intercept)    log(sqft)    has_bsmnt  
     5.1596       0.8931       0.3881  

This model suggests that a 1% increase in square footage increases expected sale price by 0.89%. In addition, having a basement increases price by 38.8%. This is helpful, but there are many different “ratings” of basement quality given in bsmnt that we are ignoring. The lowest quality is “Unf”, or unfinished. Let’s make a dummy for this type of basement and include it in the model.

Housing Price Models

Estimating this model:

Code
ames$unfinished <- ifelse(ames$bsmnt == "Unf", 1, 0)
reg2 <- lm(log(price) ~ log(sqft) + has_bsmnt + unfinished, ames)
reg2
Output

Call:
lm(formula = log(price) ~ log(sqft) + has_bsmnt + unfinished, 
    data = ames)

Coefficients:
(Intercept)    log(sqft)    has_bsmnt   unfinished  
     5.0663       0.9062       0.4363      -0.1679  

We find that having a basement, relative to no basement, is worth an expected increase of 43.6% of the sale price. However, if that basement is unfinished, we would expect a 43.6% - 16.8% = 26.8% increase in sale price.

Housing Price Models

This interpretation is a bit clunky because it requires us to add up two coefficients to get our total estimate. Rather, to make interpretation easier, we can redefine has_bsmnt to exclude unfinished basements:

Code
ames$has_bsmnt <- ifelse(ames$bsmnt %in% c("No Basement", "Unf"), 0, 1)
reg2 <- lm(log(price) ~ log(sqft) + has_bsmnt + unfinished, ames)
reg2
Output

Call:
lm(formula = log(price) ~ log(sqft) + has_bsmnt + unfinished, 
    data = ames)

Coefficients:
(Intercept)    log(sqft)    has_bsmnt   unfinished  
     5.0663       0.9062       0.4363       0.2683  

This interpretation is much easier. An unfinished basement, relative to no basement, is worth a 26.8% increase. A non-unfinished basement, relative to no basement, is worth a 43.6% increase. Hopefully you can start to see a pattern. If we wanted to estimate the effect of some other basement quality level, we would have to again redefine has_bsmnt and create a new dummy variable. If we take this logic to the finish line, we’d end up \(N-1\) dummy variables where \(N\) is the number of possible categories. We subtract one because one group must be the reference group.

Housing Price Models

I’m sure you’ve heard this one before: work smarter, not harder. In programming, they say: A good programmer is a lazy programmer. While one of these is earnest and the other sarcastic, they mean similar things. In our case, making a bunch of dummy variables can be a lot of work! Luckily, R has a lazy efficient solution for us.

To create many dummies all at once, we can use the factor() command. Moreover, we can even select which factor becomes our reference group! To do this in R, we would do the following:

Code
# I am putting a "_" to differentiate between the original variable.
ames$bsmnt_ <- as.factor(ames$bsmnt)
head(ames$bsmnt_)
cat("\n")
ames$bsmnt_ <- relevel(ames$bsmnt_, ref = "No Basement")
head(ames$bsmnt_)
Output
[1] BLQ Rec ALQ ALQ GLQ GLQ
Levels: ALQ BLQ GLQ LwQ No Basement Rec Unf

[1] BLQ Rec ALQ ALQ GLQ GLQ
Levels: No Basement ALQ BLQ GLQ LwQ Rec Unf

Housing Price Models

Then, if we throw this into a regression, R will create all of the dummies for us. I will also include bedrooms and age into this model for completeness.

Code
reg3 <- lm(log(price) ~ log(sqft) + bsmnt_ + bedrooms + age, ames)
summary(reg3)
Output

Call:
lm(formula = log(price) ~ log(sqft) + bsmnt_ + bedrooms + age, 
    data = ames)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.86737 -0.11004  0.00482  0.12182  0.77719 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.8752614  0.1011122   58.11   <2e-16 ***
log(sqft)    0.8568861  0.0146948   58.31   <2e-16 ***
bsmnt_ALQ    0.3529548  0.0242718   14.54   <2e-16 ***
bsmnt_BLQ    0.3285093  0.0253126   12.98   <2e-16 ***
bsmnt_GLQ    0.3932176  0.0240384   16.36   <2e-16 ***
bsmnt_LwQ    0.2831736  0.0274313   10.32   <2e-16 ***
bsmnt_Rec    0.2922540  0.0251362   11.63   <2e-16 ***
bsmnt_Unf    0.2459743  0.0233557   10.53   <2e-16 ***
bedrooms    -0.0694749  0.0055325  -12.56   <2e-16 ***
age         -0.0047742  0.0001461  -32.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1987 on 2920 degrees of freedom
Multiple R-squared:  0.763, Adjusted R-squared:  0.7623 
F-statistic:  1045 on 9 and 2920 DF,  p-value: < 2.2e-16

Housing Price Models

The model spits out a bunch of numbers for bsmnt_, but it’s important to remember that they are all interpreted as having that “type” of basement relative to not having a basement, controlling for the property’s age, size and number of bedrooms. Moreover, we can plot the coefficients and show that they’re generally increasing in the order that we’d expect.

Plot

Coefficient estimates for the type of basement and the effect on housing prices.

Basement Rating Reminder
BsmtFin Type 1  (Ordinal): Rating of basement finished area

       GLQ  Good Living Quarters
       ALQ  Average Living Quarters
       BLQ  Below Average Living Quarters   
       Rec  Average Rec Room
       LwQ  Low Quality
       Unf  Unfinshed
       NA   No Basement
Source: https://jse.amstat.org/v19n3/decock/DataDocumentation.txt

It is important to point out the assumptions of this model. We are assuming that changing the square footage (or age or bedrooms) of a home impacts sale price independently of the type of basement. Of course, we could estimate a fully interacted model, but the output would be huge and uninterpretable.

Fixed Effects

One of the most common ways to use dummy variables is for fixed effects. A common use case for fixed effects is when you are working with a time-by-unit panel. For example: a state-by-year panel.

By including state dummies, or fixed effects, we are removing / controlling for any and all constant (time-invariant) factors. As an example, if the outcome variable in our state-by-year panel is housing prices, we can expect California to have high prices and West Virginia to have low prices. Why? For a variety of reasons: weather and other local amenities, economic factors, etc. To make the states more comparable to one another, we can remove the effect of the idiosyncratic differences intrinsic to each state.

We can also include year fixed effects, we can remove any and all factors that impact each state at the same time. For example, the 2007 financial crisis reduced housing prices across the country. When we include dummies for each year (except one, the reference group), we can remove the influence of these events on our outcome.

Fixed Effects

When using fixed effects, we write our equations a bit differently. If we have a state-by-year panel, you will see equations written like the following:

\[Y_{st} = \beta_s + \beta_t + \beta_1 X_{st} + \beta_2 Z_{st} + \epsilon_{st}\]

where \(\beta_s\) represents state (\(s\)) fixed effects and \(\beta_t\) represents time (\(t\)) fixed effects. Notice how we do not include \(\beta_0\) (the intercept). This is because the fixed effects take the place of the intercept in practice.

Fixed Effects

To demonstrate fixed effects, we are going to compare per-capita personal income and housing price indices for different counties/cities in Hampton Roads over time. Let’s take a look at the data files we have.

First, we have the FIPS codes for each of the areas of interest. FIPS codes are unique identifies for counties/cities in the United States. The first two digits denote the state and the final three numbers denote the county or city within that state.

Code
hr <- read.csv("https://alexcardazzi.github.io/econ311/data/hr_fips.csv")
hr # hr is "hampton roads"
Output
                                    name  fips
1                      Gloucester County 51073
2                   Isle of Wight County 51093
3  James City County + Williamsburg City 51931
4                         Mathews County 51115
5                            York County 51199
6                        Chesapeake City 51550
7                           Hampton City 51650
8                      Newport News City 51700
9                           Norfolk City 51710
10                       Portsmouth City 51740
11                          Suffolk City 51800
12                   Virginia Beach City 51810

Fixed Effects

The other two datasets we are going to use contains per-capita personal income while the second contains house price index values. Both contain observations by FIPS code and year.

Code
pi <- read.csv("https://alexcardazzi.github.io/econ311/data/pcpi.csv")
head(pi, 12) # pi is "personal income"
Output
                    series_id value  fips year
1  Per Capita Personal Income  3696 51073 1969
2  Per Capita Personal Income  3916 51073 1970
3  Per Capita Personal Income  4278 51073 1971
4  Per Capita Personal Income  4812 51073 1972
5  Per Capita Personal Income  5115 51073 1973
6  Per Capita Personal Income  5329 51073 1974
7  Per Capita Personal Income  5815 51073 1975
8  Per Capita Personal Income  6493 51073 1976
9  Per Capita Personal Income  6986 51073 1977
10 Per Capita Personal Income  7819 51073 1978
11 Per Capita Personal Income  8297 51073 1979
12 Per Capita Personal Income  9478 51073 1980
Code
hp <- read.csv("https://alexcardazzi.github.io/econ311/data/athpi.csv")
head(hp, 12) # hp is "housing price"
Output
                            series_id  value  fips year
1  All-Transactions House Price Index  83.51 51073 1993
2  All-Transactions House Price Index  83.21 51073 1994
3  All-Transactions House Price Index  85.19 51073 1995
4  All-Transactions House Price Index  86.72 51073 1996
5  All-Transactions House Price Index  88.11 51073 1997
6  All-Transactions House Price Index  92.83 51073 1998
7  All-Transactions House Price Index  95.75 51073 1999
8  All-Transactions House Price Index 100.00 51073 2000
9  All-Transactions House Price Index 104.24 51073 2001
10 All-Transactions House Price Index 109.65 51073 2002
11 All-Transactions House Price Index 115.22 51073 2003
12 All-Transactions House Price Index 131.49 51073 2004

We will need to “merge” these three data files in order to run our regression.

Fixed Effects

To begin merging, we should check out which data set has the most “coverage”. For example, the pi dataset contains the range of years 1969, 2021 whereas hp contains only 1975, 2022. This can also be seen graphically:

Code
par(mfrow = c(1, 2))
plot(table(pi$year), las = 1,
     main = "Personal Income",
     xlab = "Year", ylab = "Frequency")
plot(table(hp$year), las = 1,
     main = "Housing Price Index",
     xlab = "Year", ylab = "Frequency")
par(mfrow = c(1, 1))
Plot

Distribution plots of year for both personal income and housing price index.

These figures suggest we should merge into pi, or else we would potentially be dropping data. It’s always a good idea to have too much than too little.

Fixed Effects

Next, we are going to merge in the names from hr into pi so we can have a better way to decipher which FIPS code is which county/city. To do this, we are going to match the entries in fips in pi to the entries in fips in hr. We are going to use match() to help us with this.

match() accepts two arguments. The first is the values to be matched, and the second is the values to match against. The resulting vector tells us the location of each element of the first vector (pi$fips) in the second vector (hr$fips).

As a simple example, consider the following. I have a short vector of colleges/universities. We have a second data source that contains the state these institutions are located. The code below will pull the state for each institution in the first vector from the second data source.

Code
v1 <- c("ODU", "WVU", "RCNJ")
ex <- data.frame(name = c("RCNJ", "RU", "Union", "ODU", "WVU", "WCU"),
                 state = c("NJ", "NJ", "NY", "VA", "WV", "NC"))
the_match <- match(v1, ex$name)
the_match; cat("\n")
ex$state[the_match]
Output
[1] 4 5 1

[1] "VA" "WV" "NJ"

Fixed Effects

Now let’s use the same code to pull the FIPS names from hr.

Code
m <- match(pi$fips, hr$fips)
pi$name <- hr$name[m]
View(pi[,c("value", "fips", "year", "name")])

Fixed Effects

Next, we need to pull in the housing price index data from hp. We can try a similar process:

Code
match(pi$fips, hp$fips) -> m
pi$hp <- hp$value[m]
View(pi[,c("value", "fips", "year", "hp")])

Unfortunately, this did not work as we intended – there should be variation in hp across years. When using match(), R will always match to the first instance of a match. Since we did not specify anything about years, R pulled the first value for each FIPS which is clearly not what we’re looking for.

Fixed Effects

We need to create a variable that contains information about FIPS and year. To do this, we are going to use paste(). This function takes two or more variables and “pastes” them together into one. As an example: paste(c("ODU", "WVU"), c("Monarchs", "Mountaineers")): ODU Monarchs, WVU Mountaineers.

Code
match(paste(pi$fips, pi$year),
      paste(hp$fips, hp$year)) -> m
pi$hp <- hp$value[m]
View(pi[,c("value", "fips", "year", "hp")])

Fixed Effects

You might notice that there is quite a bit of missing data in the hp column of pi. That is because we have more personal income data than housing price index data. Now that we have the two variables lined up in the same data frame, let’s explore their relationship a bit. First, let’s plot the two variables in a scatterplot:

Code
pi$value000 <- pi$value/1000
plot(pi$value000, pi$hp, las = 1,
     xlab = "Personal Income (in Thousands)", pch = 19,
     ylab = "Housing Price Index (100 is 2000 $'s)",
     col = scales::alpha("mediumseagreen", 0.4))
Plot

Personal Income vs Housing Price Index

Fixed Effects

There appears to be a pretty strong, positive relationship between these variables. Now let’s take a look at what some regressions tell us about the strength of the relationships. First, we are going to estimate a model where housing price index is the outcome and personal income is the explanatory variable. Next, we are going to include FIPS and year fixed effects.

Code
coef(lm(hp ~ value000, pi))[2]; cat("\n")
coef(lm(hp ~ value000 + as.factor(fips) + as.factor(year), pi))[2]
Output
value000 
3.698684 

 value000 
0.4945878 

Why are these two coefficients so different?

Fixed Effects

We have already investigated what happens when we include control variables into our regressions: their effects get removed from both the outcome variable and the other explanatory variables. In this case, it is likely that time is a third variable that is driving both housing prices and personal income. Let’s revisit the coefficient bias table. Since time has a positive effect on both housing prices and personal income, the bias is likely positive. In other words, by not accounting for time, the coefficient magnitudes are being artificially pushed in positive direction.

Fixed Effects

Let’s take a look at what happens when we partial out the effect of our fixed effects.

We are going to run two regressions with only fixed effects. One of these regressions will have housing price as the outcome variable and the other regression will have personal income as the outcome. After each estimation, we are going to collect the residuals, which is the left over variation after removing the influence of the fixed effects. Then, we are going to plot the two sets of residuals against each other and run a third regression between them.

Code
r_hp <- lm(hp ~ as.factor(fips) + as.factor(year), pi, subset = !is.na(pi$hp))
r_pi <- lm(value000 ~ as.factor(fips) + as.factor(year), pi, subset = !is.na(pi$hp))

plot(r_pi$residuals, r_hp$residuals, las = 1)

r_resid <- lm(r_hp$residuals ~ r_pi$residuals)
coef(r_resid)[2]
Output
r_pi$residuals 
     0.4945878 
Plot

Residuals of housing price index vs residuals of personal income.

Fixed Effects

The coefficient from the fixed effects model matches the coefficient from this funky residual model! Taking residuals from two equations like this is called the Frisch–Waugh–Lovell theorem.

The point you should take away is that often times there are unobserved idiosyncratic factors that need to be controlled for. As an example, our coefficient estimate shrunk by an order of magnitude, or almost 10x, after we included fixed effects.

Linear Probability Model

So far in this module, we have learned about binary explanatory variables. However, what if your outcome variable is binary? We can still use OLS for this!

Let’s consider some data on police treatment of individuals arrested in Tortonto for simple possession of small quantities of marijuana (data, documentation). In these data, we can observe whether the arrestee was released, their race, age, sex, employment status, citizen status and previous police interactions.

Let’s build a model to explain whether the arrestee was released given some of these controls.

Code
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/carData/Arrests.csv")
df$released <- ifelse(df$released == "Yes", 1, 0)
df$black <- ifelse(df$colour == "Black", 1, 0)
df$female <- ifelse(df$sex == "Female", 1, 0)
df$citizen <- ifelse(df$citizen == "Yes", 1, 0)
df$employed <- ifelse(df$employed == "Yes", 1, 0)
summary(r_lpm <- lm(released ~ female + citizen + employed + age + black + checks, df))
Output

Call:
lm(formula = released ~ female + citizen + employed + age + black + 
    checks, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.98272  0.03631  0.09415  0.19252  0.60548 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.7405159  0.0238808  31.009  < 2e-16 ***
female      -0.0034265  0.0179731  -0.191    0.849    
citizen      0.0892898  0.0143901   6.205 5.89e-10 ***
employed     0.1236378  0.0125926   9.818  < 2e-16 ***
age          0.0004879  0.0006049   0.807    0.420    
black       -0.0573499  0.0119889  -4.784 1.77e-06 ***
checks      -0.0499783  0.0033967 -14.714  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3581 on 5219 degrees of freedom
Multiple R-squared:  0.09551,   Adjusted R-squared:  0.09447 
F-statistic: 91.85 on 6 and 5219 DF,  p-value: < 2.2e-16
Coefficient Interpretation
  1. The coefficient for female suggests that females are 0.3% less likely than men to be released following an arrest for small amounts of marijuana. However, this coefficient is not statistically different from zero, so we do not have evidence that this difference is meaningful.
  2. The estimate for citizen suggests that, relative to non-citizens, citizens are 8.9% more likely to be released.
  3. Similarly, those who are employed, relative to those who are unemployed, are 12.4% more likely to be released.
  4. An increase in age results in an (insignificant) 0.04% increase in the probability of being released.
  5. Individuals who are black, controlling for other factors, are less likely by 5.7% to be released following an arrest.
  6. For each additional increase checks reduces the probability of being released by 5%.
  7. The intercept suggests a baseline probability of being released of 74.1%. Note that this applies when all other variables are equal to zero. Specifically, this says that the probability of being released for unemployed white male non-citizens with no previous arrest history and who are age zero is 74.1%.
Are these results causal?

An interesting, albeit unsurprising, result of this regression is that black Canadians are less likely to be released than white Canadians. Can we interpret this coefficient as causal? Recall the four reasons why we might find a correlation:

  1. The causal interpretation: race causes release. To settle on this one, we need to rule out the other three.
  2. The reverse causality interpretation: release causes race. This one is clearly implausible, and can be promptly ruled out.
  3. The third variable interpretation: is there some third variable that is causing both race and release? Unfortunately, given this setting/regression, we cannot rule out there being some third factor that is correlated with race and whether an individual is released following an arrest.
  4. The random chance interpretation: we found this correlation by “accident”. This is unlikely given the number of observations we have and our statistical tests. We can probably rule this out.
If we could figure out a clever way (e.g. a Randomized Control Trial) to rule out all possible third variables, we could make causal claims. Unfortunately, we will not cover that in this course. However, if you want to stick around and take Econ 400 in the Spring, we’ll get to things like this!

Linear Probability Model

For variables with few unique values, like checks, we can transform them into factors even though they are not technically categorical. This can help us capture non-linear effects. Below are estimated coefficients from a linear probability model with check fixed effects. In addition is a plot of the coefficients with their 95% confidence intervals to visualize the results.

Code
r_ck <- lm(released ~ black + female + citizen + employed + age + as.factor(checks), df)
summary(r_ck)
Output

Call:
lm(formula = released ~ black + female + citizen + employed + 
    age + as.factor(checks), data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96115  0.05103  0.09370  0.18417  0.57696 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         0.7313119  0.0240721  30.380  < 2e-16 ***
black              -0.0605972  0.0119997  -5.050 4.57e-07 ***
female              0.0001721  0.0179659   0.010   0.9924    
citizen             0.0884361  0.0144127   6.136 9.09e-10 ***
employed            0.1221697  0.0125839   9.708  < 2e-16 ***
age                 0.0003205  0.0006079   0.527   0.5981    
as.factor(checks)1 -0.0155505  0.0148940  -1.044   0.2965    
as.factor(checks)2 -0.0452317  0.0154331  -2.931   0.0034 ** 
as.factor(checks)3 -0.1315396  0.0148573  -8.854  < 2e-16 ***
as.factor(checks)4 -0.2152762  0.0168947 -12.742  < 2e-16 ***
as.factor(checks)5 -0.2534383  0.0330511  -7.668 2.07e-14 ***
as.factor(checks)6 -0.1208529  0.1200205  -1.007   0.3140    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3575 on 5214 degrees of freedom
Multiple R-squared:  0.09935,   Adjusted R-squared:  0.09745 
F-statistic: 52.29 on 11 and 5214 DF,  p-value: < 2.2e-16
Plot

Reporting Coefficients

When reporting regression models, people are usually uninterested in fixed effects. Rather, people like to include table rows that indicate which fixed effects are included in which models. As an example, let’s consider the housing price and personal income regressions.

Code
fe1 <- lm(hp ~ value000, pi)
fe2 <- lm(hp ~ value000 + as.factor(fips) + as.factor(year), pi)

regz <- list(`Housing Price Index` = fe1,
             `Housing Price Index` = fe2)
coefz <- c("value000" = "Personal Income (in Thousands)")
gofz <- c("nobs", "r.squared")

rowz <- as.data.frame(matrix(c('Fixed Effects', "No", "Yes",
                               "Something Else", "A", "B"),
                             byrow = T, nrow = 2)) # 2 rows
attr(rowz, 'position') <- c(3, 4) # rows 3 and 4

modelsummary(regz,
             title = "Effect of Personal Income on Housing Price Index",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz,
             add_rows = rowz)
Effect of Personal Income on Housing Price Index
 Housing Price Index  Housing Price Index
Personal Income (in Thousands) 3.699*** 0.495***
(0.077) (0.130)
Fixed Effects No Yes
Something Else A B
Num.Obs. 391 391
R2 0.856 0.991